microsoft 的gpt2模型源码学习记录

2023-08-15 22:22| 来源: 网络整理| 查看: 265

相关链接： gpt2论文传送门 microsoft Deepspeed gpt2源码传送

微软 Deepspeed 中集成的 gpt2 代码感觉比 huggingface 的代码可读性要强很多，这里只用作代码结构的学习，暂时忽略其中模型分片并行的部分。

（虽然感觉直接把精华给忽略了Orz）

文章目录 1. GPT2模型概述2. GPT2代码模块阅读2.1 GPT2Model主模块2.2 GPT2Transformer 模块2.3 GPT2TransformerLayer 模块2.4 GPT2SelfAttention 模块2.5 GPT2MLP 模块 3. GPT2 模型预训练3.1 GPT2 预训练 - 构造模型3.2 GPT2 预训练 - forward 参考内容

1. GPT2模型概述

GPT2 是2018年发布的预训练模型，使用超过40G的近8000万的网页文本数据对模型进行训练。

GPT-2 可以理解成是由 transforer 的decoder 堆叠成的，输入是 word embeddings + position embeddings。 transformer 模块处理单词的步骤如下：首先通过自注意力层处理，接着将其传递给神经网络层。第一个 transformer 模块处理完但此后，会将结果向量被传入堆栈中的下一个 transformer 模块，继续进行计算。每一个 transformer 模块的处理方式都是一样的，但每个模块都会维护自己的自注意力层和神经网络层中的权重。

2. GPT2代码模块阅读

GPT-2的代码模块可读性较强，整体框架如下：

在这里插入图片描述

2.1 GPT2Model主模块

GPT2Model主模块

class GPT2Model(torch.nn.Module): """GPT-2 Language model. The output of the forward method are the logits (parallel or serial depending on the `parallel_output` flag. """ def __init__(self, num_layers, vocab_size, hidden_size, num_attention_heads, embedding_dropout_prob, attention_dropout_prob, output_dropout_prob, max_sequence_length, checkpoint_activations, checkpoint_num_layers=1, parallel_output=True): super(GPT2Model, self).__init__() self.parallel_output = parallel_output init_method = init_method_normal(std=0.02) # Word embeddings (parallel). # 生成 word embedding，shape 是 vocab_size * hidden_size，用于lookup embedding self.word_embeddings = mpu.VocabParallelEmbedding( vocab_size, hidden_size, init_method=init_method) # Position embedding (serial). # position embedding，shape 是vocab_size * hidden_size，用于每个position 的 lookup embedding，是绝对位置编码 self.position_embeddings = torch.nn.Embedding(max_sequence_length, hidden_size) # Initialize the position embeddings. init_method(self.position_embeddings.weight) # Embeddings dropout self.embedding_dropout = torch.nn.Dropout(embedding_dropout_prob) # Transformer # 构建transformer模块（后文详细说） self.transformer = mpu.GPT2ParallelTransformer(num_layers, # transformer 层数 hidden_size, num_attention_heads, # 多头attention的头数 attention_dropout_prob, output_dropout_prob, checkpoint_activations, checkpoint_num_layers) def forward(self, input_ids, position_ids, attention_mask): # Embeddings. # 根据输入 id 做 look up embeddings words_embeddings = self.word_embeddings(input_ids) # 根据位置id 做 look up embeddings position_embeddings = self.position_embeddings(position_ids) # 实际的输入是文本+位置 embedding embeddings = words_embeddings + position_embeddings # Dropout. embeddings = self.embedding_dropout(embeddings) # Transformer. # 将 embedding 和 mask作为transformer的输入 transformer_output = self.transformer(embeddings, attention_mask) # Parallel logits. # 并行计算的logits transformer_output_parallel = mpu.copy_to_model_parallel_region( transformer_output) logits_parallel = F.linear(transformer_output_parallel, self.word_embeddings.weight) if self.parallel_output: return logits_parallel return mpu.gather_from_model_parallel_region(logits_parallel) 2.2 GPT2Transformer 模块

GPT2ParallelTransformer 模块是封装在 mpu/transformer.py 里的，mpu就是模型并行的框架了，里面封装了bert和gpt2并行训练的代码。这里只看原理相关的部分了，暂时忽略并行的部分。

该模块是模型的主模块，即将n个的 transformer blocks 打包在一起，即 n * transformer layer + final layernorm 两部分组成。

单独的transformer layer代码详见 2.3。

class GPT2ParallelTransformer(torch.nn.Module): """GPT-2 transformer. This module takes input from embedding layer and it's output can be used directly by a logit layer. It consists of L (num-layers) blocks of: layer norm self attention residual connection layer norm mlp residual connection followed by a final layer norm. Arguments: num_layers: Number of transformer layers. hidden_size: The hidden size of the self attention. num_attention_heads: number of attention head in the self attention. attention_dropout_prob: dropout probability of the attention score in self attention. output_dropout_prob: dropout probability for the outputs after self attention and final output. checkpoint_activations: if True, checkpoint activations. checkpoint_num_layers: number of layers to checkpoint. This is basically the chunk size in checkpoitning. layernorm_epsilon: epsilon used in layernorm to avoid division by zero. init_method_std: standard deviation of the init method which has the form N(0, std). use_scaled_init_for_output_weights: If Ture use 1/sqrt(2*num_layers) scaling for the output weights ( output of self attention and mlp). """ def __init__(self, num_layers, hidden_size, num_attention_heads, attention_dropout_prob, output_dropout_prob, checkpoint_activations, checkpoint_num_layers=1, layernorm_epsilon=1.0e-5, init_method_std=0.02, use_scaled_init_for_output_weights=True, sparse_attention_config=None, max_seq_length=None): super(GPT2ParallelTransformer, self).__init__() # Store activation checkpoiting flag. self.checkpoint_activations = checkpoint_activations self.checkpoint_num_layers = checkpoint_num_layers output_layer_init_method = None if use_scaled_init_for_output_weights: output_layer_init_method = scaled_init_method(init_method_std, num_layers) # 返回一个 transformer layer（后面具体说） def get_layer(): return GPT2ParallelTransformerLayer( hidden_size, num_attention_heads, attention_dropout_prob, output_dropout_prob, layernorm_epsilon, unscaled_init_method(init_method_std), output_layer_init_method=output_layer_init_method, sparse_attention_config=sparse_attention_config, max_seq_length=max_seq_length) # Transformer layers. # 构建 num_layer 个 transformer layer self.layers = torch.nn.ModuleList( [get_layer() for _ in range(num_layers)]) # Final layer norm before output. self.final_layernorm = LayerNorm(hidden_size, eps=layernorm_epsilon) if deepspeed.checkpointing.is_configured(): global get_cuda_rng_tracker, checkpoint get_cuda_rng_tracker = deepspeed.checkpointing.get_cuda_rng_tracker checkpoint = deepspeed.checkpointing.checkpoint def forward(self, hidden_states, attention_mask): def custom(start, end): # 这里定义的 custom 函数我理解是用来加载自己的checkpoint的某些层 def custom_forward(*inputs): layers_ = self.layers[start:end] x_ = inputs[0] for layer in layers_: x_ = layer(x_, inputs[1]) return x_ return custom_forward if self.checkpoint_activations: l = 0 num_layers = len(self.layers) chunk_length = self.checkpoint_num_layers while l 4*h -> h mlp_output = self.mlp(layernorm_output) # Second residual connection. output = layernorm_input + mlp_output return output 2.4 GPT2SelfAttention 模块

这是GPT2的 self-attention模块，这里先忽略模型并行的部分，只根据原理来看对应的代码。（但是其中涉及到并行分片的部分，还是要简单看下）

这一部分是核心部分，重点其实就是多头self-attention 的运算，以及attention-mask，这部分注释写的就比较详细，搭配原作者注释一起看更好理解。

class GPT2ParallelSelfAttention(torch.nn.Module): """Parallel self-attention layer for GPT2. Self-attention layer takes input with size [b, s, h] where b is the batch size, s is the sequence lenght, and h is the hidden size and creates output of the same size. Arguments: hidden_size: total hidden size of the layer (h). num_attention_heads: number of attention heads (n). Note that we require n to be divisible by number of GPUs used to parallelize the model. Also, we require hidden size to be divisible by n. dropout_prob: dropout probability for the attention scores. init_method: weight initialization. output_layer_init_method: output layer initialization. If None, use `init_method`. We use the following notation: h: hidden_size n: num_attention_heads p: number of partitions （不做模型分片的时候p=1） np: n/p = n (p=1时) hp: h/p = h (p=1时) hn: h/n 每个attention的 hidden size b: batch size s: sequence length """ def __init__(self, hidden_size, num_attention_heads, attention_dropout_prob, output_dropout_prob, init_method, output_layer_init_method=None): super(GPT2ParallelSelfAttention, self).__init__() # Set output layer initialization if not provided. if output_layer_init_method is None: output_layer_init_method = init_method # Per attention head and per partition values. world_size = get_model_parallel_world_size() # 这里是模型分片，不分片的时候 world_size=1，无变化 self.hidden_size_per_partition = divide(hidden_size, world_size) # 这里是计算每头attention的hidden_size # 比如 hidden_size=256，指定 attention 头数=8 # 则每个attention头能够分到的 hidden size=256/8=32 # （这里是多头attention的定义，和模型分片无关） self.hidden_size_per_attention_head = divide(hidden_size, num_attention_heads) # 这里是分片后的每片模型的 attention头数，world_size=1，值为传入的 num_attention_heads # 假如 world_size=2，即模型分成2片分别放在2块GPU上跑 # 则每片模型训练 (num_attention_heads/2) 个attention self.num_attention_heads_per_partition = divide(num_attention_heads, world_size) # Strided linear layer. # 这里的 ColumnParallelLinear 是一个基于模型分片的线性变化层，本质上就是一个 y = x * W + b 的操作 # 将输入为 hidden_size 变换成 3*hidden_size # 当模型不分片的时候 stride 和 gather_output 这两个参数是没有用的 # （知道是什么样的操作就行，这里就不细看了） self.query_key_value = ColumnParallelLinear(hidden_size, 3*hidden_size, stride=3, gather_output=False, init_method=init_method) # Dropout. Note that for a single iteration, this layer will generate # different outputs on different number of parallel partitions but # on average it should not be partition dependent. # 对 attention 做 dropout，这里作者注释，对于每一次迭代，不同分片上的模型参数，将会有不同的dropout结果，但是理论上来说取平均不会受到模型分片的影响。 # （应该是这个意思，解释了一下在分片的情况下dropout不受影响） self.attention_dropout = torch.nn.Dropout(attention_dropout_prob) # Output. # 这里是做一个变换，定义weight=[h,h] self.dense = RowParallelLinear(hidden_size, hidden_size, input_is_parallel=True, init_method=output_layer_init_method) self.output_dropout = torch.nn.Dropout(output_dropout_prob) if deepspeed.checkpointing.is_configured(): global get_cuda_rng_tracker, checkpoint get_cuda_rng_tracker = deepspeed.checkpointing.get_cuda_rng_tracker checkpoint = deepspeed.checkpointing.checkpoint def _transpose_for_scores(self, tensor): """Transpose a 3D tensor [b, s, np*hn] into a 4D tensor with size [b, np, s, hn]. 模型不分片的时候，np=n，hn=h/n，所以3d矩阵实际上就是 [b, s, h]，需要将其按照attention 头数进行分解，得到一个 [b, n, s，hn] """ # 先计算目标矩阵的shape：(b,s) + (np, hn) = (b, s, np, hn) new_tensor_shape = tensor.size()[:-1] + \ (self.num_attention_heads_per_partition, self.hidden_size_per_attention_head) # 将当前矩阵分解为目标 shape tensor = tensor.view(*new_tensor_shape) return tensor.permute(0, 2, 1, 3) def forward(self, hidden_states, ltor_mask): # hidden_states: [b, s, h] # ltor_mask: [1, 1, s, s] # Attention heads. [b, s, hp] # p=1时为正常的 [b, s, h] # query_key_value 运算: [b, s, h] -> [b, s, 3*h] # 因为是self-attention，所以 qw kw vw都是和 hidden states相乘，所以这里做运算的时候 [b, s, h] * [h, 3h] -> [b, s, 3*h] # 相当于是把 qw kw vw 拼在了一起计算，后面把最后一维拆分即可 mixed_x_layer = self.query_key_value(hidden_states) # split_tensor_along_last_dim 函数是将输入矩阵的最后一维均匀分成n份，这里是 3 份 # 这里计算就是将 q k v 从上面得到的矩阵中份离开 # q k v 的 shape 都是[b, s, h] (mixed_query_layer, mixed_key_layer, mixed_value_layer) = split_tensor_along_last_dim(mixed_x_layer, 3) # Reshape and transpose [b, np, s, hn] # 根据attention的头数，将q k v 进行分解， np * hn = h （p=1时） query_layer = self._transpose_for_scores(mixed_query_layer) key_layer = self._transpose_for_scores(mixed_key_layer) value_layer = self._transpose_for_scores(mixed_value_layer) # Raw attention scores. [b, np, s, s] # q * k 得到 attention score attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2)) # q * k / sqrt(hn) attention_scores = attention_scores / math.sqrt( self.hidden_size_per_attention_head) # Apply the left to right attention mask. # 做 mask 运算，注意此时 attention-score的shape是[b, np, s, s] # ltor_mask 的shape 也是[1,1,s,s]，它是一个上三角全0 下三角全1的矩阵 # 两者做哈达玛积（就是逐元素相乘） # 只保留当前word之前的attention score，之后的都-1000，即设置为极小值 attention_scores = torch.mul(attention_scores, ltor_mask) - \ 10000.0 * (1.0 - ltor_mask) # Attention probabilities. [b, np, s, s] # 对最后一维做softmax，即可得到每个位置的 attention_probs attention_probs = torch.nn.Softmax(dim=-1)(attention_scores) # This is actually dropping out entire tokens to attend to, which might # seem a bit unusual, but is taken from the original Transformer paper. # 作者注释：这里做attention dropout实际上会删掉一些 attention 得分，这可能有点问题。（确实。。。。这里留个疑问吧） with get_cuda_rng_tracker().fork(): attention_probs = self.attention_dropout(attention_probs) # Context layer. # [b, np, s, hn] # 这里做点积，[b, np, s, s] * [b, np, s, hn] -> [b, np, s, hn] context_layer = torch.matmul(attention_probs, value_layer) # [b, s, np, hn] # 这里是多头合并，先把维度调换回去 context_layer = context_layer.permute(0, 2, 1, 3).contiguous() # 然后计算合并后的shape (b,s)+(h) = (b,s,h)，这里还是假设模型分片=1（不分片） new_context_layer_shape = context_layer.size()[:-2] + \ (self.hidden_size_per_partition,) # [b, s, hp] 合并成目标shape context_layer = context_layer.view(*new_context_layer_shape) # Output. [b, s, h] # 输出，过一个dense+dropout # 这里的dense层就是之前定义的RowParallelLinear，当前模型分片=1的话，就是 [b, s, h] * [h, h] -> [b, s, h] output = self.dense(context_layer) output = self.output_dropout(output) return output 2.5 GPT2MLP 模块

这里GPT2的MLP模块，其实就是将最后一维 hidden states 做了个非线性变换：h -> 4h -> h

class GPT2ParallelMLP(torch.nn.Module): """MLP for GPT2. MLP will take the input with h hidden state, project it to 4*h hidden dimension, perform gelu transformation, and project the state back into h hidden dimension. At the end, dropout is also applied. Arguments: hidden_size: The hidden size of the self attention. output_dropout_prob: dropout probability for the outputs after self attention and final output. init_method: initialization method used for the weights. Note that all biases are initialized to zero and layernorm weight are initialized to one. output_layer_init_method: output layer initialization. If None, use `init_method`. """ def __init__(self, hidden_size, output_dropout_prob, init_method, output_layer_init_method=None): super(GPT2ParallelMLP, self).__init__() # Set output layer initialization if not provided. if output_layer_init_method is None: output_layer_init_method = init_method # Project to 4h. # 这里和上文一样，不要被名字唬住，其实就是 y=wx+b计算 # [b,s,h] * [4h,h]T -> [b,s,4h] # 即 weight shape = [outout,input] ，如果模型需要分片，则对 output_size分片 self.dense_h_to_4h = ColumnParallelLinear(hidden_size, 4*hidden_size, gather_output=False, init_method=init_method) # Project back to h. # y=xw+b 运算，[b,s,4h] * [h,4h]T -> [b,s,h] # 即 weight shape = [outout,input] ，如果模型需要分片，则对 input_size分片 self.dense_4h_to_h = RowParallelLinear( 4*hidden_size, hidden_size, input_is_parallel=True, init_method=output_layer_init_method) self.dropout = torch.nn.Dropout(output_dropout_prob) def forward(self, hidden_states): # [b, s, 4hp] intermediate_parallel = self.dense_h_to_4h(hidden_states) intermediate_parallel = gelu(intermediate_parallel) # [b, s, h] output = self.dense_4h_to_h(intermediate_parallel) output = self.dropout(output) return output

到这里，GPT-2的模型主要结构的代码就看完了。

3. GPT2 模型预训练

接下来从forward部分简单看下 GPT2 做预训练的时候如何构造损失函数。

这部分代码在 pretrain_gpt2.py 中

3.1 GPT2 预训练 - 构造模型 def get_model(args): """Build the model.""" print_rank_0('building GPT2 model ...') # 这里构造的就是第2小节详细说的 GPT2Model model = GPT2Model(num_layers=args.num_layers, vocab_size=args.vocab_size, hidden_size=args.hidden_size, num_attention_heads=args.num_attention_heads, embedding_dropout_prob=args.hidden_dropout, attention_dropout_prob=args.attention_dropout, output_dropout_prob=args.hidden_dropout, max_sequence_length=args.max_position_embeddings, checkpoint_activations=args.checkpoint_activations, checkpoint_num_layers=args.checkpoint_num_layers, parallel_output=True) if mpu.get_data_parallel_rank() == 0: print(' > number of parameters on model parallel rank {}: {}'.format( mpu.get_model_parallel_rank(), sum([p.nelement() for p in model.parameters()])), flush=True) #To prevent OOM for model sizes that cannot fit in GPU memory in full precision # 使用 deepspeed 和 fp16 的时候 # 仅仅在权重更新的时候使用fp32,耗时的前向和后向运算都使用fp16 # half()方法将模型中的float32转化为float16 if args.deepspeed and args.fp16: model.half() # GPU allocation. # 显示地将模型加载到GPU上 model.cuda(torch.cuda.current_device()) # Fp16 conversion. # 使用 fp16 混合精度可以有效节省内存，这部分可以另外写个代码分析，这里就不展开说了 if args.fp16: model = FP16_Module(model) # Wrap model for distributed training. if USE_TORCH_DDP: i = torch.cuda.current_device() model = DDP(model, device_ids=[i], output_device=i, process_group=mpu.get_data_parallel_group()) else: model = DDP(model) return model 3.2 GPT2 预训练 - forward

这里是预训练过程中的 forward。

过程也很简单，先走model forward，计算得到 GPT-2 的output，然后计算loss。这里的 input 是sentence[:-1] true label是 sentence[1:]，即，对于长度为seq_len 的输入，1～seq_len - 1 个token是 input，后 2～seq_len 个token是label。

def forward_step(data_iterator, model, args, timers): """Forward step.""" # Get the batch. timers('batch generator').start() tokens, labels, loss_mask, attention_mask, position_ids = get_batch( data_iterator, args, timers) timers('batch generator').stop() # Forward model. # output shape = [b,s,vocab_size] # output 这里 seq_len 上的每个位置的 hidden_states 都可以理解成为，已知了前n个token，当前位置的预测 token output = model(tokens, position_ids, attention_mask) # output在最后一维上取最大值作为预测值，计算和label直接的交叉熵 losses = mpu.vocab_parallel_cross_entropy(output.contiguous().float(), labels) # 这里的 loss_mask 是将 end_token mask 掉 loss_mask = loss_mask.view(-1) loss = torch.sum(losses.view(-1) * loss_mask) / loss_mask.sum() return loss 参考内容完全图解GPT-2：看完这篇就够了（一）预训练模型专题_GPT2_模型代码学习笔记（这个博主做了huggingface gpt2代码的阅读笔记，可以一起学习下）

【本文地址】

公司简介

联系我们